Robust-stein estimator for overcoming outliers and multicollinearity

Linear regression models with correlated regressors can negatively impact the performance of ordinary least squares estimators. The Stein and ridge estimators have been proposed as alternative techniques to improve estimation accuracy. However, both methods are non-robust to outliers. In previous studies, the M-estimator has been used in combination with the ridge estimator to address both correlated regressors and outliers. In this paper, we introduce the robust Stein estimator to address both issues simultaneously. Our simulation and application results demonstrate that the proposed technique performs favorably compared to existing methods.

where, 1 ≥ 2 ≥ .. ≥ p * , are the ordered eigenvalues of (X ′ X) and T is a (p * × p * ) orthogonal matrix whose columns are the corresponding eigenvectors of 1 ≥ 2 ≥ .. ≥ p * . Rewrite the linear regression model in Eq. (1.1) in canonical form: where H =XT, α =Tβ, T X ′ X T ′ = H ′ H = � . With the presence of correlated regressors (multicollinearity) the ordinary least squares estimator α OLS is inadequate and inefficient. Also, outlier(s) negatively affect the parameter estimates of α LS . The M-estimator is efficient for handling outliers in the y-direction 15 . Let α M be the M-estimator of α, and can be obtained across a solution of M-estimating equations. The effects of outliers in the y-direction are eliminated by the weights of the residuals in the iterative reweighted least-squares approach used to solve M-estimating equations 10,15 .
where π(.) indicates a robust criterion function and η is a scale parameter estimate. α M is obtained through a s o l u t i o n o f M -e s t i m at i n g e q u at i o n s n i=1 φ e i η = 0 a n d n i=1 φ e i η x i = 0 , w h e re , e i = y i − p * j=1 α j−M h ij , φ = π ′ is a useful selected function 10 .
where jj is the j th element of the main diagonal of the matrix Var α M = � , which is finite. The ridge regression estimator of α is defined as: The scalar mean squared error (SMSE) of α Ridge and the matrix mean squared error (MMSE) of α Ridge are calculated as: T X ′ X T ′ = � = diag j , j = 1, 2, . . . , p * , p * = p + 1 , where jj is the j th element of the main diagonal of the matrix Var α M = �.

Proof:
The difference between SMSE α M−JSE andSMSE α LS is given by: It is obvious from Eq. (2.28) that σ 2 � jj + j α 2 j is greater than � jj α 2 j j . Thus, the difference is less than zero and the proof is completed.
where jj is the j th element of the main diagonal of the matrix Var α M = �.

Proof:
The difference between SMSE α M−JSE andSMSE α Ridge is given by: www.nature.com/scientificreports/ It is obvious from Eq. (2.30) that σ 2 j α 2 j j + � jj + k 2 α 4 j j is greater than � jj α 2 j j j + 2k . Thus, the difference is less than zero and the proof is completed. It is obvious from Eq. (2.32) that � 2 jj j + k 2 α 2 j � jj + k 2 α 4 j j is greater than � jj α 2 j k k + 2 j . Thus, the difference is less than zero and the proof is completed.
where jj is the j th element of the main diagonal of the matrix Var α M = �.

Proof:
The difference between SMSE α M−JSE andSMSE α JSE is given by: Thus, the difference is less than zero and the proof is completed.

Simulation study
This section provides a simulation study using the R programming language to compare the performance of the non-robust and robust estimators.
Simulation design. The design of this simulation study is based on specifying the variables that are anticipated to have an impact on the features of suggested estimator and selecting a metric to assess the outcomes. Following the cited references 24-28 , we generated the regressors as follows: where m ij independent standard normal are pseudo-random numbers, p * denotes the number of regressors ( p * =4, 8,12) and ρ denotes level of multicollinearity ( ρ = 0.7, 0.8, 0.9, 0.99) . Thus, the response variable is given by: where h = 10 is added to inflate the response variable 36,37 . The ridge parameter k is obtained using the following equation: where σ 2 = n j=1 e 2 i n−r , e i = y − y and r denotes the number of estimated parameter. T he u nbi as e d e st i mator of jj is as y mptot i c a l ly Thus, the parameter for M-Ridge is determined using the following equation: The estimated mean squared error (MSE) is computed as follows: where, β ij is the estimated j th parameter in the i th replication and β j is the j th true parameter value. The estimated values of the mean squared error (MSE) of the proposed and other estimators are displayed in Tables 1, 2, 3, 4, 5 and 6 for p * =4 with 10% outliers, p * =8 with 10% outliers, p * =12 with 10% outliers, p * =4 with 20% outliers, p * =8 with 20% outliers and p * =12 with 10% outliers respectively.  The proposed estimator α M−JSE consistently exhibits the lowest MSE values across all simulation settings, surpassing both the OLS estimator and other biased estimators. To investigate the impact of outliers on the estimated regression parameters, we considered two different percentages of outliers in the y-direction. As the percentage increases from 10 to 20%, the MSE of all estimators shows a corresponding increase. In order to assess the influence of multicollinearity on the regression parameter estimates, we varied the correlation coefficients between explanatory variables (ρ = 0.7, 0.8, 0.9, 0.99). It was observed that increasing the correlation between explanatory variables resulted in higher MSE values for all estimators. When evaluating the performance of the estimators relative to the sample size (n = 30, 50, 100, 200) while keeping p, the percent of outliers, and σ fixed, a noticeable trend emerged: the MSE consistently decreased as the sample size grew. Additionally, the parameter σ had a significant impact on the MSE, as its increase led to a corresponding rise in the MSE for all estimators. The total number of explanatory variables also influenced the MSE values for all estimators. A higher number of explanatory variables resulted in higher MSE values. Under all simulation conditions, it is observed that the proposed is the most effective choice for mitigating multicollinearity in the presence of outliers.

Real-life application
In this section, we adopted three examples to evaluate the performance of the estimators.
Example I We utilized a pollution dataset that has been previously analyzed by various researchers 38,39 . The response variable is the total age-adjusted mortality rate per 100,000, which is a linear combination of 15 covariates. For a more detailed description of the data, refer to 38,39 .
First, we employed the least squares method to fit model (1.1) and obtained the residuals. The diagnostic plots in Fig. 4 were obtained via the residuals, which indicated that certain observations were outliers. Specifically, the residual versus fitted plot identified data points 26, 31, and 37 as outliers, and the normal Q-Q plot indicated that data points 26, 32, and 37 were outliers. The residual versus leverage plot identified observations 18, 32, and 37 as outliers, while the scale-location plot picked observations 32 and 37. These observations reveal that To address the issues of correlated regressors and outliers, we estimated the model using the ridge regression, the Stein estimator, the M-ridge, and the proposed robust Stein estimator. We compared the performance of these estimators using the scalar mean squared error (SMSE), and the regression estimates and SMSE values are provided in Table 7.
From Table 7, we observed that due to the sensitivity of the OLS estimator to correlated regressors (multicollinearity) and outliers, it exhibited the worst performance in terms of SMSE. The coefficients of all the estimates were similar, except for x 6, where only M-ridge and M-Stein had a positive coefficient. As expected, the robust ridge dominated the ridge estimator since the ridge estimator is sensitive to outliers. However, the Stein estimator performed better than the ridge estimator, as reported in the literature. Most notably, the proposed robust version of the Stein estimator (M-JSE) outperformed every estimator under the study.  Table 8. It was observed that the regression estimate of the Stein estimator was the same as that of OLS, with a computed value of c approximately equal to 1 (c = 0.9996761). However, the Stein estimator exhibited a lower mean squared error than the OLS estimator. The ridge estimator dominated the Stein estimator in this instance, but the M-Ridge outperformed the ridge estimator by accounting for both multicollinearity and outliers. The proposed M-JSE performed the best in terms of smaller MSE.

Example III
We analyzed the Longley data to predict the total derived employment, which is a linear function of the following predictors: gross national product implicit price deflator, gross national product, unemployment, size of armed forces, and non-institutional population 14 years of age and over 33 Fig. 6 shows that certain observations are anomalous, namely data points 9, 10, and 16.
We used both robust and non-robust estimators to analyze the data, and the results are presented in Table 9. The table indicates that the regression estimates of OLS and Stein are the same, with a value of c = 1. However, the Stein estimator has a lower SMSE than OLS. The Stein estimator dominates the ridge and robust ridge estimators in this instance. Furthermore, the proposed robust Stein estimator provides optimal performance based on the results.
In summary, the Longley data analysis indicates that the model suffers from multicollinearity and contains anomalous observations. However, using the robust Stein estimator provides the best performance among the estimators considered in this study.

Some concluding remarks
Linear regression models (LRMs) are widely used for predicting the response variable based on a combination of regressors. However, correlated regressors can decrease the efficiency of the ordinary least square method. Alternative methods such as the Stein and the ridge estimators can provide better estimations in such situations. However, these methods can be sensitive to outlying observations, leading to unstable predictions.
To address this issue, researchers have previously combined the ridge estimator with robust estimators (such as M-estimators) to account for both correlated regressors and outliers.
In this study, we developed a new biased estimator that offers an alternate approach to handling multicollinearity in linear regression, it is boosted Stein estimator by combining the M-estimator with the Stein estimator. Pseudo random numbers are created for both the independent and dependent variables in a Monte Carlo experiment. Different sample sizes, correlation strengths, and quantities of independent variables are taken into account. Our simulation and application results demonstrate that the robust Stein estimator outperforms the other estimators considered.
It is noted that, in the case of high multicollinearity, the suggested estimator showed its best performance by means of the reduction of the estimated MSE values and it is not affected by multicollinearity as much as other estimators. According to the tables, there is some difference between the performances of the suggested estimators according to the shrinkage parameter that is used and it may be concluded that, k m is the best shrinkage parameter among others in most cases. www.nature.com/scientificreports/ The findings of this paper will be beneficial for practitioners who encounter the challenge of dealing with multicollinearity and outliers in their data. By using the Robust Stein estimator, they can obtain more stable and accurate predictions.
While this study has made substantial progress in addressing the challenges of LRMs, there are still avenues for further exploration. Future research endeavors should consider incorporating other robust estimators including the robust Liu estimator, Robust Liu-type estimator, robust linearized ridge estimator, Jackknife Kibria-Lukman M-Estimator, Modified Ridge-Type M-Estimator to conduct a more comprehensive comparative analysis 13,14,[45][46][47] . This will contribute to a deeper understanding of the strengths and limitations of different approaches in handling complex data scenarios.
Another potential direction for future research is the extension of the current study using neutrosophic statistics. Neutrosophic statistics is an extension of classical statistics that is particularly useful when dealing with data from complex processes or uncertain environments [48][49][50][51][52][53] . By incorporating neutrosophic statistics, we can account for additional sources of uncertainty and variability, which may further enhance the robustness and applicability of our proposed estimator.          Figure 6. Graphical detection of outliers using longley data.